Improve listing performance of Hudi tables#17084
Improve listing performance of Hudi tables#17084arunthirupathi merged 1 commit intoprestodb:masterfrom
Conversation
4832f4d to
109ff66
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Minor comments. Looking in good shape!
There was a problem hiding this comment.
We should look at cleaning the path filter up altogether, if this just true always? Follow up PR?
There was a problem hiding this comment.
Yes. I kept it to be compatible in case users disable and not use HudiDirectoryLister. But, actually we introduced the path filter in #13818 . Let's remove it in a follow up PR.
| <dep.druid.version>0.19.0</dep.druid.version> | ||
| <dep.jaxb.version>2.3.1</dep.jaxb.version> | ||
| <dep.hudi.version>0.9.0</dep.hudi.version> | ||
| <dep.hudi.version>0.10.0</dep.hudi.version> |
There was a problem hiding this comment.
this needs to be called out in the PR summary?
| private static boolean shouldUseFileSplitsForHudi(InputFormat<?, ?> inputFormat, Configuration conf, String tablePath) | ||
| private static boolean shouldUseFileSplitsForHudi(InputFormat<?, ?> inputFormat, Optional<HoodieTableMetaClient> metaClient) | ||
| { | ||
| if (inputFormat instanceof HoodieParquetRealtimeInputFormat) { |
There was a problem hiding this comment.
is n't this check same as !isHudiParquetInputFormat(inputFormat) above? any chances to simplify?
There was a problem hiding this comment.
HoodieParquetRealtimeInputFormat is subclass of HoodieParquetInputFormat. However, we want to use file splits from the former but not latter. Hence, a separate check.
| String partition = FSUtils.getRelativePartitionPath(new Path(tablePath), directory); | ||
| if (fileStatuses.isPresent()) { | ||
| fileSystemView.addFilesToView(fileStatuses.get()); | ||
| this.hoodieBaseFileIterator = fileSystemView.fetchLatestBaseFiles(partition).iterator(); |
There was a problem hiding this comment.
can we just call the same API here? getLatestBaseFiles() and simplify? i think getLatestBaseFiles() calls fetchLatestBaseFiles
There was a problem hiding this comment.
getLatestBaseFiles() also calls ensurePartitionLoadedCorrectly() which adds files to fs view. Since, in this case we have already done that, so directly calling fetchLatestBaseFiles ().
109ff66 to
6603580
Compare
arunthirupathi
left a comment
There was a problem hiding this comment.
Can this PR contain some unit test instead of manual verifiication ?
There was a problem hiding this comment.
Is this method an override ? If not, why does it return optional, though the underlying value is always present.
There was a problem hiding this comment.
Yeah, optional is not needed here. I will remove it.
6603580 to
b4049d4
Compare
Hi @arunthirupathi thanks for reviewing the PR. I have added a test for |
There was a problem hiding this comment.
Why is hbase dependency required ?
There was a problem hiding this comment.
Hbase is a pretty big dependency and this could cause issues.
There was a problem hiding this comment.
Hudi uses HBase in two ways:
- The metadata files are written as HFile.
- To maintain index of Hudi filegroup id to external bootstrap file.
In this PR, we just needBytesclass from hbase-common for initializingHoodieTableFileSystemViewand I have excluded deps that we don't need. We have created a thinner hudi-presto-bundle which I plan to update in presto-hive pom in a subsequent PR.
There was a problem hiding this comment.
Currently Presto does not pull in other bigger open source jars (hadoop, hive) and the only way to do is to fork the code reduce it to only bare minimal code and then take a dependency on it.
https://github.com/prestodb/presto-hadoop-apache2
https://github.com/prestodb/presto-hive-apache
Hbase will cause dependency hell for Presto as some of the dependencies are conflicting. https://mvnrepository.com/artifact/org.apache.hbase/hbase-common/2.4.8
This will be a blocker to merge this PR.
There was a problem hiding this comment.
Even linked PR, shades everything which will avoid dependency issues but it will cause the jar size to explode.
What is the size of the shaded jar ?
There was a problem hiding this comment.
It's about 13 MB.
Unfotunately, HFile is only packaged as part of HBase. Would it be more preferable to use hudi-presto-bundle instead of adding hbase directly?
There was a problem hiding this comment.
@arunthirupathi Can you please review this again? Check the latest commit where i'm using the stripped down hudi-presto-bundle.
There was a problem hiding this comment.
For presto-hive can you please do a dependency tree diff of before and after and paste what changed ?
mvn dependency:tree -pl presto-hive (run this on master and the change branch)
There was a problem hiding this comment.
Below is the diff. The full dependency tree is in this gist.
sagars@Sagars-MacBook-Pro presto % git diff --no-index presto_hive_master_dep.txt presto_hive_hudi_dep.txt
diff --git a/presto_hive_master_dep.txt b/presto_hive_hudi_dep.txt
index 3f6e69f26a..21733b24f4 100644
--- a/presto_hive_master_dep.txt
+++ b/presto_hive_hudi_dep.txt
@@ -20,8 +20,7 @@
[INFO] +- com.facebook.presto:presto-expressions:jar:0.268-SNAPSHOT:compile
[INFO] +- com.facebook.presto:presto-cache:jar:0.268-SNAPSHOT:compile
[INFO] | \- org.alluxio:alluxio-shaded-client:jar:2.7.0:compile
-[INFO] +- org.apache.hudi:hudi-common:jar:0.9.0:compile
-[INFO] +- org.apache.hudi:hudi-hadoop-mr:jar:0.9.0:compile
+[INFO] +- org.apache.hudi:hudi-presto-bundle:jar:0.10.0:compile
[INFO] +- com.facebook.presto:presto-memory-context:jar:0.268-SNAPSHOT:compile
[INFO] +- com.facebook.presto:presto-hive-common:jar:0.268-SNAPSHOT:compile
[INFO] +- com.facebook.presto:presto-hive-metastore:jar:0.268-SNAPSHOT:compile
@@ -249,6 +248,6 @@
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
-[INFO] Total time: 2.809 s
-[INFO] Finished at: 2021-12-23T12:51:07+05:30
+[INFO] Total time: 2.857 s
+[INFO] Finished at: 2021-12-23T12:52:44+05:30
[INFO] ------------------------------------------------------------------------
fa6e417 to
a25fee9
Compare
arunthirupathi
left a comment
There was a problem hiding this comment.
left one comment, but the rest looks good.
There was a problem hiding this comment.
Why is the exclusion required here and other directories ?
There was a problem hiding this comment.
Without ignoring, the maven duplicate finder plugin complains about "Found duplicate (but equal) classes in [dependency A, dependency B]."
There was a problem hiding this comment.
Does the prersto-hudi bundle contains copy of this class , but not shaded ?
There was a problem hiding this comment.
Yeah. I just created a PR to shade them apache/hudi#4495
There was a problem hiding this comment.
When will the new presto-hudi bundle available ? Can you use the new presto-hudi bundle instead of this one ?
There was a problem hiding this comment.
@arunthirupathi We plan a 0.10.1 minor release in Jan. Can we merge this and we can follow up after? Do you see any issues with this exclusion.
There was a problem hiding this comment.
@arunthirupathi We plan a 0.10.1 minor release in Jan. Can we merge this and we can follow up after? Do you see any issues with this exclusion.
I don't see any, if you folks make effort to release this in Jan and remove the exclusion later. As usual, I will revert on first sign of trouble.
There was a problem hiding this comment.
Once the test passes, I will merge this change.
a25fee9 to
7a65d9e
Compare
arunthirupathi
left a comment
There was a problem hiding this comment.
There are failing tests, Can you please rebase and squash the commits ?
There was a problem hiding this comment.
Does the prersto-hudi bundle contains copy of this class , but not shaded ?
ff77c1d to
fe74619
Compare
Done. But looks like there is one flaky test - |
|
For the failing tests, rebase and push couple of times. If it fails after two attempts let me know, I can take a look and either disable or reach out to the author and fix it. |
fe74619 to
716250a
Compare
- Integrate metadata-based listing for Hudi tables. This is enabled by a session property. - Implement a new DirectoryLister for Hudi that uses HoodieTableFileSystemView to fetch data files. - Bump Hudi version to 0.10.0 to use above features. Add unit tests for HudiDirectoryLister Replace hudi-common and hudi-hadoop-mr by hudi-presto-bundle
716250a to
618aa95
Compare
|
First, I apologize for the issue caused. I was concerned about the unshaded classes, but all the tests passed, so I thought this was not an issue. Can you please share more details about the errors you are running into ? I will take a look and see if I can add more tests/errors to exercise them. Give as much details as possible, to prevent future regressions. |
|
Hi Sagar, Sumit. Current conflicts, in facebook's installation of presto: Found duplicate and different classes in [com.facebook.presto.hive:hive-apache:3.0.0-7, org.apache.hudi:hudi-presto-bundle:0.10.0]: |
|
Hmmm. Not sure if we can shade parquet-avro, but we could skip bundling it with the hudi-presto-bundle. We will take a closer look at the hudi-presto-bundle, we have a 0.10.1 upcoming. Apologies for your trouble! |
enabled by a session property.
HoodieTableFileSystemView to fetch data files.
Test plan -
Synced a Hudi table to Hive and queried through Presto
with both metadata enabled and disabled.